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ABSTRACT 

Context. The Richardson-Lucy method is the most popular deconvolution method in astronomy because it preserves the 
number of counts and the non-negativity of the original object. Regularization is, in general, obtained by an early stopping of 
Richardson-Lucy iterations. In the case of point-wise objects such as binaries or open star clusters, iterations can be pushed to 
^ J convergence. However, it is well-known that Richardson-Lucy is an inefficient method. In most cases and, in particular, for low 
noise levels, acceptable solutions are obtained at the cost of hundreds or thousands of iterations, thus several approaches to 
accelerating Richardson-Lucy have been proposed. They are mainly based on Richardson-Lucy being a scaled gradient method 
for the minimization of the KuUback-Leibler divergence, or Csiszar I-divergence, which represents the data-fidelity function in 
the case of Poisson noise. In this framework, a line search along the descent direction is considered for reducing the number of 
iterations. 

Aims. A general optimization method, referred to as the scaled gradient projection method, has been proposed for the 
constrained minimization of continuously differentiable convex functions. It is applicable to the non-negative minimization 
of the Kullback-Leibler divergence. If the scaling suggested by Richardson-Lucy is used in this method, then it provides a 
considerable increase in the efficiency of Richardson-Lucy. Therefore the aim of this paper is to apply the scaled gradient 
projection method to a number of imaging problems in astronomy such as single image deconvolution, multiple image 
deconvolution, and boundary effect correction. 

Methods. Deconvolution methods are proposed by applying the scaled gradient projection method to the minimization of the 
Kullback-Leibler divergence for the imaging problems mentioned above and the corresponding algorithms are derived and 
implemented in interactive data language. For all the algorithms, several stopping rules are introduced, including one based 
on a recently proposed discrepancy principle for Poisson data. To attempt to achieve a further increase in efficiency, we also 
consider an implementation on graphic processing units. 

Results. The proposed algorithms are tested on simulated images. The acceleration of scaled gradient projection methods 
achieved with respect to the corresponding Richardson-Lucy methods strongly depends on both the problem and the specific 
object to be reconstructed, and in our simulations the improvement achieved ranges from about a factor of 4 to more than 
30. Moreover, significant accelerations of up to two orders of magnitude have been observed between the serial and parallel 
implementations of the algorithms. The codes are available upon request. 
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1. Introduction 



The Richardson-Lucy (RL) algorithm (Richardson 119721 
Lucy I1974P is a renowned iterative method for image de- 
convolution in astronomy and other sciences. Here, we de- 
fine g to be the detected image and A the imaging matrix 
given by Af = K * f, where K is the point spread func- 
tion (PSF) and * denotes a convolution. If the PSF is then 



normalized to unit volume, the iteration, as modified by 
Snyder fSnvder fTMO)) . is 



/(^) o A'' 



Af 



(fc) 



(1) 



where A'^ is the transposed matrix, 6 is a known array 
representing background emission, x oy denotes the pixel 
by pixel product of two equally-sized arrays x, y, and x/y 
their quotient. 
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It is well-known that the method has several interest- 
ing features. The result of each iteration is non-negative 
and robust against small errors in the PSF, and that flux 
is conserved both globally and locally if b = 0. 

In such a case, it has also been proven by several au- 
thors (see, for instance, Natterer & Wiibbeling l200ip that 
the iterations converge to either a maximum likelihood 
solution for Poisson data (Shepp & Vardi fTOS^ or, equiv- 
alently, to a minimizer of the Kullback-Leibler (KL) di- 
vergence, which is also known as the Csiszar I-divergence 
fCsiszar ri99ip . given by 



+ (A/)(m) + b(m)-g(m)} , 



(2) 



where S is the set of values of the multi-index m labeling 
the image pixels. 

As shown in Barrett & Meyers (|2003p . the non- 
negative minimizers of Jo{f; g) are sparse objects, i.e. they 
consist of bright spots over a black background. Therefore, 
in the case of simple astronomical objects, such as binaries 
or open star clusters, the algorithm can be pushed to con- 
vergence (examples are given in Sect. 4), while, in the case 
of more complex objects, an early stopping of the itera- 
tions, providing a "regularization effect" , is required. The 
problem of introducing suitable stopping rules is briefly 
discussed in Sect. 3. 

The main disadvantage of the RL algorithm is that it 
is not very efficient: it may require hundreds or thousands 
of iterations for images with a large number of counts (low 
Poisson noise) . In the case of large-scale images or multiple 
images of the same target, the computational cost can 
become prohibitive. For this reason, several acceleration 
schemes have been proposed, of which we mention a few. 

The first is the "multiplicative relaxation" proposed 
by Llacer & Niiiiez (jl990p , which consists in replacing the 
iteration of Eq. ^ by 



A' 



Af 



(fe) 



(3) 



with a > 1 . Convergence is proved in lusem (I199ip for a < 
2. As demonstrated in Lanteri et al (I200ip . this approach 
can provide a reduction in the number of iterations by a 
factor of a, with essentially the same cost per iteration. 
For low numbers of counts numerical convergence has been 
found also for a > 2 (Anconelli et al. I2005p . A "linear 
relaxation" is investigated in Adorf et al. (|1992p . It can 
be written in the form 



c(fc+l) 



fc) 



A' 



Af 



(fe) 



(4) 



where Afe > 1 (for = 1 the RL algorithm is re-obtained) 
and 1 is the array with all entries equal to 1. Since the 
quantity in brackets is the gradient of Jo(/;9)j we note 
that RL is a scaled gradient method with a scaling given 
by f^'^^ at iteration k, and that the relaxation method is 



essentially a line search along this descent direction, which 
can be performed by minimizing the objective function 
Jo{f',g) (Adorf et al. I1992p or applying the Armijo rule 
(Lanteri et al. I200ip . A moderate increase in efficiency is 
then observed by these authors. The values reached after 
convergence of the algorithms can be inferred from general 
results of optimization theory (Bertsckas l2003p . Finally, a 
greater increase in efficiency on the order of ten, is ob- 
served using an acceleration method proposed by Biggs & 
Andrews (|1997p . which exploits a suitable extrapolation 
along the trajectory of the iterates, and is implemented in 
the deconvlucy function of the Image Processing MATLAB 
toolbox. The problem with this method is that no conver- 
gence proof is available and, in our experience, a devi- 
ation from the trajectory of RL iterations is sometimes 
observed, providing unreliable results. 

Bonettini et al. (|2009p developed an optimization 
method, which they called scaled gradient projection 
(SGP) method, to constrain the minimization of a con- 
vex function, and proved that its convergence occurs un- 
der mild conditions. This method can be quite naturally 
applied to the non-negative minimization of the KL di- 
vergence, using the scaling of the gradient suggested by 
RL, hence this application of SGP can also be consid- 
ered as a more efficient version of RL. In Bonettini et al. 
(|2009p . the performance of the new method is compared 
with that of RL and the Biggs & Andrews method, as 
implemented in MATLAB, providing an improvement in 
efficiency comparable to that of the latter method, but 
sometimes better and without its drawbacks. Further ap- 
plications of SGP in image restoration problems can be 
found e.g. in Benvenuto et al. 120101 Bonettini & Prato 
[^OTUl and Zanella et al. IMISl 

The purpose of this paper is not only to illustrate 
the features of SGP to the astronomical community, but 
also to extend its application to the problems of multi- 
ple image deconvolution and boundary effect correction. 
The first problem is fundamental, for instance, to the re- 
construction of the images of the future interferometer 
of the Large Binocular Telescope (LBT), denoted LING- 
NIRVANA (Herbst et al. while the second prob- 
lem is important in both single and multiple image de- 
convolution. All the algorithms are implemented in inter- 
active data language (IDL) and the codes will be freely 
distributed. Moreover, we present an implementation for 
GPU (graphic processor unit) is also provided. In this pa- 
per, we consider only the constraint of non- negativity. 
Bonettini et al (|2009p investigated both non-negativity 
and flux conservation and provided an efficient algorithm, 
for computing the projection on the convex set defined 
by the constraints. However, their numerical experiments 
seem to demonstrate that the additional flux constraint 
does not significantly improve the reconstructions. 

The paper is organized as follows. In Sect. 2, after a 
brief description of the general SGP algorithm in the case 
of non-negativity constraint, we derive its application to 
the problems of both single and multiple image decon- 
volution and boundary effect correction. In Sect. 3, we 
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describe the IDL and GPU codes and in Sect. 4 we dis- 
cuss our numerical experiments illustrating the increase in 
efRciency achievable with the proposed methods. In Sect. 
5, we discuss possible implementation improvements and 
extensions to regularized problems. 

2. The deconvoiution methods 

We first describe the monotone SGP algorithm for min- 
imizing a convex and difFerentiable function on the non- 
negative orthant. For the general version of the algorithm 
including a flux constraint, we refer to Bonettini ct al. 
(|2009p . Next, we outline the application of SGP to the 
three imaging problems mentioned in the Introduction. 

2.1. The scaled gradient projection (SGP) method 

The SGP scheme is a gradient method for the solution of 
the problem 



mm Jo(/;£/) 



(5) 



where Jo{f]g) is a convex and continuously difFerentiable 
function defined for each one of the problems considered 
in this paper. Each SGP iteration is based on the descent 
direction d^^^ = 

(fc) ^ 



y 



y(.k) _ where 
-«fcI?fcVJo(/("^s)) 



(6) 



is defined by combining a scaled steepest descent direction 
with a projection on the non-negative orthant. The matrix 
Dk in Eq. © is chosen in the set V of the n x n diagonal 
positive definite matrices, whose diagonal elements have 
values between Li and L2 for given thresholds < ii < 

The main SGP steps are given in algorithm [TJ The 
global convergence of the algorithm is obtained by means 
of the standard monotone Armijo rule in the line-search 
procedure described in step 5 (see Bonettini et al. I2009p . 

We emphasize that any choice of the steplength ak S 
[ctmin J Olmax] ^nd the scaliug matrix G T) are allowed; 
this freedom of choice can then be fruitfully exploited for 
introducing performance improvements. 
An effective selection strategy for the steplength parame- 
ter is obtained by adapting to the context of the scaling 
gradient methods the Barzilai and Borwein (jl988p rules 
(hereafter denoted BB), which arc widely used in stan- 
dard nonscaled gradient methods. When the scaled direc- 
tion -DfcVJo(/^'^^;g) is exploited within a step of the form 
{f^^ - akDkVMf^^^\g)), the BB steplength rules be- 
come 

JBBl) _ s"^ > jJfc jJfc > 



(k-l) 



{BB2) 



(8) 



where 5(^-1) ^ /(fe)-/!/^-!) and ^C^-^) = V Jo(/('=) ; g)- 
V Jo(/^'^~^^ ;g). In SGP, we constrain the values produced 



Algorithm 1 Scaled gradient projection (SGP) method 

Choose the starting point /'■°' > and set the parameters 
13,9 e (0,1), 0< 

For fc = 0, 1, 2, ... DO THE FOLLOWING STEPS: 

Step 1. Choose the parameter £ [amin,amax] and the 

scaling matrix Dk £ T>; 
Step 2. Projection: 



Step 3. Descent direction: d^'"' = y'*' - 

Step 4. Set Xk = 1; 

Step 5. Bacfctracking loop: 

let J„e.„ = Jo(/W+AfcdW;g); 

If 



Jue» < Jo(/^"';g)+/?AfeVJo(/''';g)^d"=) 

then 

go to step 6; 
Else 

set Afc = 6\k and go to step 5. 
Endif 

Step 6. Set = /C'' + \kd^''K 



End 



by these rules into the interval [amimCtmax] in the follow- 
ing way: 

IF < THEN 



ELSE 



[3^ = min{10 ■ Q!fc_i, a„iax}; 



ENDIF 



3' = min^arnax, max I 



IF S 



C^'^y DkZ^'^-^^ <o 



THEN 



(2) 

= min{10 • Q!fc_i, ar,iax}; 



ELSE 
(2) 

ENDIF 



(2) 

ctf. ^ mm 



/ / (BB2)\\. 

■j ^niax^ max s O-jjiiyi, > >, 



The recent literature on steplength selection in gradi- 
ent methods propose that steplength updating rules be 
designed by alternating the two BB formulae (Serafini 
et al. 120051 Zhou et al. 12006^ . In the case of nonscaled 
gradient methods (i.e., Dk = I) where the inequality 
o'^^'^^ < Oi'^^^^ holds (Serafini et al. I2005p . remarkable 
convergence rate improvements have been obtained by al- 
ternation strategies that force the selection to be made 
in a suitable order of both low and high BB values. In 
Frassoldati et al. (j2008p . this aim is realized by an alter- 
nation criterion, which compares well with other popular 
BB-like steplength rules, namely 



(2) / (1) / 
IF /ctk < Tk THEN 



j=max{l,k+l-Ma}, 

Tfc+i 0.9 • Tk; 

ELSE 



v(2). 



(9) 
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(1) 1 1 

ENDIF 

where Ma is a prefixed positive integer and ri € (0,1). 
When scaled versions of the BB rules given in Eqs. ([7])- 
dHJ are used, the inequality a^f^^^ < a^f^^^ is not al- 
ways true. Nevertheless, a wide computational study sug- 
gests that this alternation criterion is more suitable in 
terms of convergence rate than the use of a single BB rule 
(Bonettini et al. [MM Favati et al. [^UTUl Zanella et al. 
I2009p . Furthermore, in our experience, the use of the BB 
values provided by Eq. (|9]) in the first iterations slightly 
improves the reconstruction accuracy and, consequently, 
in the proposed SGP version we start the steplength al- 
ternation only after the first 20 iterations. 

When selecting the scaling matrix Dk, a suitable up- 
dating rule generally depends on the special form of the 
objective function. In our case, we chose the scaling matrix 
suggested by the RL algorithm, i.e.. 



Dk = diag I min 



(10) 



where Li,L2 are prefixed thresholds. 
2.2. Single image deconvolution 

The problem of single image deconvolution in the presence 
of photon counting noise is the minimization of the KL 
divergence defined in Eq. ([2]) and the solution is given 
by the iterative RL algorithm of Eq. ([Ij . When applying 
SGP, we only need the expression of the gradient of the 
KL divergence, which is given by (when the normalization 
of the PSF to unit volume is used) 

g 



A' 



(11) 



Af + b 

The SGP behavior with respect to RL was previously in- 
vestigated in Bonettini et al. (|2009p . 

2.3. Multiple image deconvolution 

Successful multiple image deconvolution is fundamental 
to the future Fizeau interferometer of LBT called LINC- 
NIRVANA (Herbst et al. [MI5)) or to the "co-adding" 
method of images with different PSFs proposed by Lucy 
& Hook (dMH)- 

We define p to be the number of detected images Qj, 
(j=l,..,p), with corresponding PSFs Kj, all normalized 
to unit volume, and Ajf = Kj * /. It is quite natural 
to assume that the p images are statistically independent, 
such that the likelihood of the problem is the product of 
the likelihoods of the different images. If we assume again 
Poisson statistics, and we take the negative logarithm of 
the likelihood, then the maximization of the likelihood is 
equivalent to the minimization of a data-fidelity function, 
which is the sum of KL divergences, one for each image, 
i.e. 



Jo(/;s) = EEi9,(m)ln 



9,(m) 



3 = 1 mes 



(A,/)(m) + 6,(m) 



(12) 



Algorithm 2 Ordered subset expectation maximization 
(OSEM) method 



Choose the starting point > 



Step 1. Set /i™ = /C''); 

Step 2. For j = 1, ...,p compute 



h 



h 



(i-i) 



o A 



Step 3. Set /C'+i' = H'-pK 



(15) 



End 



+ (^,/)(m) + 6,(m)-g^.(m)} . 

If we apply the standard expectation maximization 
method (Shepp & Vardi [T982)) to this problem, we obtain 
the iterative algorithm 



(fc+i) 



^1/ 



(fe) 



p 

E 



Ai 



(k) 



(13) 



which we call the multiple image RL method (multiple 
RL, for short). Since the gradient of is given by 



E 



Ai 



(14) 



we find that the algorithm presented in Eq. ([13)) is a scaled 
gradient method, with a scaling given, at iteration fc, by 
f^^^ /p. Therefore, the application of SGP to this problem 
is straightforward. 

However, for the reconstruction of LING-NIRVANA 
images, we must consider that an acceleration of the al- 
gorithm in Eq. ([T^ is proposed in Bertero & Boccacci 
(j2000p by exploiting an analogy between the images of 
the interferometer and the projections in tomography. In 
this approach called OSEM (ordered subset expectation 
maximization, Hudson & Larkin ri994p . the sum over the 
p images in Eq. ([T^ is replaced by a cycle over the same 
images. To avoid oscillations of the reconstructions within 
the cycle, a preliminary step is the normalization of the 
different images to the same flux, if different integration 
times are used in the acquisition process. The method 
OSEM is summarized in algorithmic] 

As follows from practice and theoretical remarks, this 
approach reduces the number of iterations by a factor p. 
However, the computational cost of one multiple RL iter- 
ation is lower than that of one OSEM iteration: we need 
3p-|- 1 FFTs in the first case and Ap FFTs in the second. In 
conclusion, the increase in efficiency provided by OSEM 
is roughly given by {2>p + l)/4. When p = 3 (the number 
of images provided by the interferometer will presumably 
be small), the efficiency is higher by a factor of 2.5, and a 
factor of 4.7 when p = 6. These results must be taken into 
account when considering the increase in the efficiency of 
SGP with respect to multiple RL. We can add that the 
convergence of SGP is proven while that of OSEM is not. 
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even if it has always been verifieci in our numerical exper- 
iments. 

2.4. Boundary effect correction 

If the target / is not completely contained in the image 
domain, then the previous deconvoiution methods produce 
annoying boundary artifacts. It is not the purpose of this 
paper to discuss the different methods for solving this 
problem. We focus on an approach proposed in Bertero 
& Boccacci (|2005p for single image deconvoiution and in 
Anconelli et al. (|2006p for multiple image deconvoiution. 
Here we present the equations in the case of multiple im- 
ages, where a single image corresponds to p = 1. 

The idea is to reconstruct the object / over a domain 
broader than that of the detected images and to merge, by 
zero padding, the arrays of the images and the object into 
arrays of dimensions that enable their Fourier transform 
to be computed by means of FFT. We denote by S the 
set of values of the multi-index labeling the pixels of the 
broader arrays containing S and by R that of the object 
array contributing to 5, such that S* C i? C S*. It is also 
obvious that also the PSFs must be defined over S and 
that this can be done in different ways, depending on the 
specific problem one is considering. We point out that they 
must be normalized to unit volume over S. We also note 
that R corresponds to the part of the object contributing 
to the detected images and that it depends on the extent 
of the PSFs. It can be estimated from this information 
as we indicate in the following (see Eq. ((20)) V The recon- 
struction of / outside S, is unreliable in most cases, but 
its reconstruction inside S is practically free of boundary 
artifacts, as shown in the papers cited above and in the 
experiments of Sect. 4. 

If we denote by Mr, Ms the arrays, defined over 5, 
which are 1 over i?, S respectively and outside, we define 
the matrices Aj and 



(^,/)(m)-Ms(m)^K,(m-n)Mfl(n)/(n) , (16) 

nGS 

(Ajg)(n) = Mfl(n) ^ ^.(m - n)Ms(m)g(m) . (17) 

mGS 

In the second equation, g denotes a generic array defined 
over S. Both matrices can be easily computed by means of 
FFT. With these definitions, the data fidelity function is 
then given again by Eq. (|12p . with S replaced by S, while 
its gradient is now given by 



vJo(/;9) = E{aJi-4-^ 



leading to the introduction of the functions 



(18) 



a,(n) = ^^ ("i - n)Ms(m)g(m) , (19) 

p 

Q:(n) = aj(n) , nGS . 



These functions can be used to define the reconstruction 
domain R, since they can be either very small or zero in 
pixels of S, depending on the behavior of the PSFs. Given 
a thresholding value cr, we use the definition 



(20) 



The RL algorithm, with boundary effect correction, is then 
given by 



p(fe+i) 



Mr 



(fc) 



E- 



j=i ^jf "] 



(21) 



the quotient being zero in the pixels outside R. Similarly, 
the OSEM algorithm, with a boundary effect correction is 
given by algorithm [2] where Eq. (jl5|) is replaced by 



(22) 



As far as the SGP algorithm concerns, the boundary effect 
correction is incorporated to the scaling matrix 




diag I o mm 



a. 



(23) 



while all the other steps remain unchanged. 
3. Computational features 

The description of the SGP algorithm provided in Sec. 12.11 
indicates several ingredients on which the success of the 
recipe depends: the choice of the starting point, the selec- 
tion of the parameters defining the method, and the stop- 
ping criterion. In the following, we briefiy describe which 
choices were made in our numerical experimentation, and 
comment on the parallel implementation of the algorithm. 



3.1. Initialization 



As far as the SGP initial point f^^'^ concerns, any non- 
negative image is allowed. The possible choices imple- 
mented in our code are: 

— the null image /'•°-' = 0. 

— the noisy image g (or, in the case of multiple decon- 
voiution, the noisy image gi corresponding to the first 
PSF Ki). 

— a constant image with pixel values equal to the 
background-subtracted flux (or mean flux in the case 
of multiple deconvoiution) of the noisy data divided by 
the number of pixels. If the boundary effect correction 
is considered, only the pixels in the object array R be- 
come equal to this constant, while the remaining values 
of iS* are set to zero. In future extensions of our codes, 
the constant image will be convolved with a Gaussian 
to avoid the presence of sharp edges. 

— any input image provided by the user. 

The constant image /^"^^ was chosen for our numerical 
experiments, which is also the initial point used for RL. 
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3.2. SGP parameter setting 

Even if the number of SGP parameters is certainly higher 
than those of the RL and OSEM approaches, the huge 
amount of tests carried out in several applications has led 
to an optimization of these values, which allows the user to 
have at his disposal a robust approach without the need for 
any problem-dependent parameter tuning. In our present 
study, some of these values were fixed according to the 
original paper of Bonettini et al. (|2009p . as in the case of 
the line-search parameters /3 and 9, which were set to 10~^ 
and 0.4, respectively. In addition, most of the steplength 
parameters remained unchanged, as ao = 1.3, ri = 0.5, 
ctmax = 10^, and Ma = 3, while amin was set to 10~^. 

The main change concerned the choice of the bounds 
(Li, ^2) for the scaling matrices. While in the original pa- 
per, the choice was a couple of fixed values (10~^°, 10^°), 
independent of the data, we decided to automatically 
adapt these bounds to the input image: we performed one 
step of the RL method and tuned the parameters (Li, L2) 
according to the min/max positive values ymin/ymax of the 
resulting image according to the rule 

IF ymax/ymin < 50 THEN 

Ll = 2/min/lO; 

L2 = 2/max • 10; 
ELSE 

Ll = 2/min; 
L2 = 2/max; 
ENDIF 

3.3. Stopping rules 

As mentioned in the introduction, in many instances both 
RL and SGP must not be pushed to convergence and an 
early stopping of the iterations is required to achieve rea- 
sonable reconstructions. In our code, we introduced dif- 
ferent stopping criteria, which can be adapted by the user 
according to his/her purposes: 

— fixed number of iterations. The user can decide how 
many iterations of SGP must be done. 

— convergence of the algorithm. In such a case, a stopping 
criterion based on the convergence of the data-fidelity 
function to its minimum value is introduced. Iteration 
is stopped when 

|Jo(/^'+'^s) - Jo (/^'^; 9)1 < tol Mf^''^;g) , (24) 

where tol is a parameter that can be selected by the 
user. 

— minimization of the reconstruction error. This crite- 
rion can be used in a simulation study. If one knows 
the object / used to generate the synthetic images, 
then one can stop the iterations when the relative re- 
construction error 

f(fc) _ f\ 

Ik) ^ \J ^ J\ (25) 
I/I 



reaches a minimum value. A very frequently used mea- 
sure of error is given by the £2 norm, i.e. |-| = ||-||2 and 
this is the criterion implemented in our code. 
— the use of a discrepancy criterion. In the case of real 
data, one can use a given value of some measure of the 
"distance" between the real data and the data com- 
puted by means of the current iteration. A recently 
proposed criterion consists in defining the "discrep- 
ancy function" for p images of size N x N 

T^^'^ = ^Mf^'^;9) , (26) 

and stopping the iterations when I?'*^' = b, where b 
is a given number close to 1. Work is in progress to 
validate this discrepancy criterion with the purpose 
of obtaining rules of thumb for estimating b in real 
applications. 

The last stopping rule deserves a few comments. In 
Bertero et al. (|2010p . it is shown that, for a single im- 
age, if / is the object generating the noisy image g, then 
the expected value of Jo(/; g) is close to N'^/2. This prop- 
erty is used to select a value of the regularization parame- 
ter when the image reconstruction problem is formulated 
as the minimization of the KL divergence with the addi- 
tion of a suitable regularization term. This use is justified 
by evidence that in some important cases it provides a 
unique value of the regularization parameter. Moreover, 
it has also been shown that the quantity X'^'^^ , defined in 
Eq. (pS)) . decreases for increasing k, starting from a value 
greater than 1. Therefore, it can be used as a stopping 
criterion. Preliminary numerical experiments described in 
that paper show that it can provide a sensible stopping 
rule at least in simulation studies. 

3.4. IDL and GPU implementation 

Our implementation of the SGP algorithm was written in 
IDL, a well-known and frequently used language in astro- 
nomical environments. This data-analysis programming 
language is well-suited to work with images, using opti- 
mized built-in vector operations. Nevertheless, it is not 
intended that usability should be compromised by perfor- 
mance. 

As already shown in Ruggiero et al (|2010p . the C-I-+ 
implementation of the SGP algorithm is well-suited to par- 
allelization and good computational speedup is obtained 
exploiting the CUDA technology. The CUDA is a frame- 
work developed by NVIDIA that enables the use of graphic 
processing unit (GPU) for programming. These graphics 
cards are nowadays in many personal computers and their 
core is highly parallel, consisting of several hundreds of 
computational units. Many recent applications show that 
the increase in efficiency achieved with this technology 
is significant and its cost is much lower than that of a 
medium-sized cluster. We note that memory management 
is crucial to ensure the optimal performance when using 
GPU. The transfer speed of data from central memory 
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to GPU is much slower than the GPU-to-GPU transfer, 
hence to maximize the GPU benefits it is very important 
to reduce the CPU-to-GPU memory communications and 
retain ah the problem data in the GPU memory. 

The CUDA technology is available in IDL as part of 
GPUlib, a software library that enables GPU-accelerated 
calculations, developed by Tech-X Corporation. It has to 
be noted that the FFT routine included in the current 
version of GPUlib (1.4.4) is available only in single pre- 
cision. Results from this function differ slightly from the 
ones obtained in double precision by IDL, causing some 
numerical differences in our experiments. 

4. Results 

We now demonstrate, by means of a few numerical ex- 
periments, the effectiveness of the SGP algorithm and its 
IDL-based GPU implementation in the solution of the de- 
blurring problems described in Sect. 2. Our test platform 
consists in a personal computer equipped with an AMD 
Athlon X2 Dual-Core at 3.11GHz, 3GB of RAM, and the 
graphics processing unit NVIDIA GTX 280 with CUDA 
3.2. We consider CPU implementations of RL, SGP, and 
OSEM in IDL 7.0; the GPU implementations are devel- 
oped in mixed IDL and CUDA language by means of the 
GPUlib 1.4.4. 

The set of numerical experiments can be divided into 
two groups: single image and multiple image deconvoiu- 
tion. For each group, some tests on boundary effect cor- 
rection are included. 

4.1. Single image deconvoiution 

The first experiments are based on 256 x 256 HST images 
of the planetary nebula NGC7027, the galaxy NGC6946 
and the Crab nebula NGC19521. We use three different 
integrated magnitudes (m) of 10, 12, and 15, not corre- 
sponding to the effective magnitudes of these objects but 
introduced for obtaining simulated images with different 
noise levels. In Fig. [T] we show the three objects in the left 
panels. In the following, they are denoted nebula, galaxy 
and Crab. 

These objects arc convolved with an AO-corrected 
shown in Fig. [5] without zoom, and frequently used 
in numerical experiments. The parameters of this PSF 
(pixel size, diameter of the telescope, etc.) are not pro- 
vided. However, it has approximately the same width as 
the ideal PSF used in the third experiment reported be- 
low and simulated assuming a telescope of 8.25 m, a wave- 
length of 2.2 /xm, and a pixel size of 5 mas. 

A background of about 13.5 mag arcsec"^, correspond- 
ing to observations in K-band, is added to the blurred im- 
ages and the results are perturbed with Poisson noise and 
additive Gaussian noise with ct = 10 e~/px. According to 

^ downloaded from |http://www.mathcs. emory.edu/~nagy/ 
RestoreTools / index.html 




,1 



Fig. 1. The three objects, represented with reverse gray 
scale (left panels; from up to down nebula, galaxy and 
Crab), and the reconstructions with minimum relative 
r.m.s. error (m = 15; right panels). 




Fig. 2. The PSF used in the experiments of single image 
deconvoiution (left panel), represented with reverse gray 
scale, and the corresponding MTF (right panel). 

the approach proposed in Snyder et al. (jl994p . compen- 
sation for readout noise is obtained in the deconvoiution 
algorithms by adding the constant ~ 100 to the im- 
ages and the background. In Table [U the performances of 
RL and SGP are reported, in terms of iteration numbers 
needed to obtain the minimum relative r.m.s. error, CPU 
times, and speedups provided by the two GPU versions 
with respect to the serial ones. The reconstructions corre- 
sponding to the minimum relative r.m.s. error, in the case 
TO = 15, are shown in the right panels of Fig. [TJ 

In the second experiment, we use the same datasets 
created in the previous one to test the effectiveness of 
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Table 1. Iteration numbers, relative r.m.s. errors, computational times, and speedups of RL and SGP, provided by the 
corresponding GPU implementations, for the three 256 x 256 objects nebula, galaxy and Crab. Iterations are stopped 
at a minimum relative r.m.s. error in the serial algorithms (the asterisks denote the maximum number of iterations 
allowed) 







Nebula (m = 10) 






Galaxy (■ 


m = 10) 






Crab ( 


m = 10) 




Algorithm 


It 


Err Sec 


SpUp 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


RL 


528 


0.021 41.28 


- 


10000* 


0.140 


795.3 


- 


5353 


0.128 


419.8 


- 


RL.CUDA 


528 


0.021 2.079 


19.9 


10000* 


0.140 


35.09 


22.7 


5353 


0.128 


19.45 


21.6 


SGP 


50 


0.021 4.719 




406 


0.141 


38.61 




151 


0.129 


14.28 




SGP_CUDA 


50 


0.021 0.344 


13.7 


406 


0.142 


3.313 


11.7 


151 


0.129 


1.219 


11.7 






Nebula {m = 12) 






Galaxy (■ 


m = 12) 






Crab ( 


m = 12) 




Algorithm 


It 


Err Sec 


SpUp 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


RL 


124 


0.026 9.797 




3887 


0.157 


304.6 




954 


0.136 


74.83 




RL.CUDA 


124 


0.026 0.516 


19.0 


3887 


0.157 


14.50 


21.0 


954 


0.136 


3.516 


21.3 


SGP 


24 


0.026 2.344 




153 


0.159 


14.42 




52 


0.137 


4.984 




SGP_CUDA 


24 


0.026 0.203 


11.5 


153 


0.159 


1.266 


11.4 


52 


0.137 


0.406 


12.3 






Nebula (m = 15) 






Galaxy (■ 


m = 15) 






Crab ( 


m = 15) 




Algorithm 


It 


Err Sec 


SpUp 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


RL 


124 


0.063 9.766 




448 


0.234 


35.14 




128 


0.172 


10.09 




RL.CUDA 


124 


0.063 0.469 


20.8 


448 


0.234 


1.594 


22.0 


128 


0.172 


0.483 


20.9 


SGP 


12 


0.060 1.250 




21 


0.234 


2.094 




10 


0.172 


1.093 




SGP_CUDA 


12 


0.060 0.109 


11.5 


21 


0.234 


0.156 


13.4 


10 


0.172 


0.093 


11.8 



Table 2. Reconstruction of nebula, galaxy, and Crab as a mosaic of the reconstructions of four subimages with 
boundary effect correction. The number of iterations is the one required for reconstructing each subdomain, while the 
reported computational time is the total time required for the 4 reconstructions. 







Nebula ( 


m = 10) 






Galaxy ( 


m = 10) 






Crab ( 


m = 10) 




Algorithm 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


RL 


818 


0.021 


243.8 




10000* 


0.144 


2813 




4070 


0.129 


1146 




RL.CUDA 


818 


0.021 


12.16 


20.0 


10000* 


0.144 


141.5 


19.9 


4070 


0.129 


61.55 


18.6 


SGP 


96 


0.022 


35.16 




435 


0.144 


171.6 




129 


0.129 


46.42 




SGP_CUDA 


96 


0.022 


3.406 


10.3 


435 


0.148 


14.41 


11.9 


129 


0.133 


4.342 


10.7 






Nebula ( 


m = 12) 






Galaxy ( 


m = 12) 






Crab ( 


m = 12) 




Algorithm 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


RL 


127 


0.026 


38.42 




2347 


0.160 


696.9 




696 


0.137 


196.5 




RL.CUDA 


127 


0.026 


2.108 


18.2 


2347 


0.160 


35.13 


19.8 


696 


0.137 


10.99 


17.9 


SGP 


21 


0.026 


9.563 




126 


0.161 


51.11 




53 


0.137 


19.41 




SGP_CUDA 


21 


0.026 


0.874 


10.9 


126 


0.161 


4.438 


11.5 


53 


0.137 


1.922 


10.1 






Nebula ( 


m = 15) 






Galaxy ( 


m = 15) 






Crab ( 


m = 15) 




Algorithm 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


It 


Err 


Sec 


SpUp 


RL 


96 


0.064 


27.58 




297 


0.234 


89.22 




99 


0.172 


28.08 




RL.CUDA 


96 


0.064 


1.703 


16.2 


297 


0.234 


4.547 


19.6 


99 


0.172 


1.704 


16.5 


SGP 


10 


0.061 


4.234 




17 


0.236 


7.375 




9 


0.172 


3.859 




SGP_CUDA 


10 


0.061 


0.407 


10.4 


17 


0.236 


0.657 


11.2 


9 


0.172 


0.360 


10.7 



the procedure described in Sect. 12.41 for the reduction of 
boundary effects. To this aim, the 256 x 256 blurred and 
noisy images are partitioned into four partially overlap- 
ping 160 X 160 subdomains. Each one of the four partial 
images is merged, by zero-padding, in a 256 x 256 array 
that is used, together with the original 256 x 256 PSF, 
for the reconstruction of the four parts of the object by 
means of the RL and SGP algorithms with boundary ef- 
fect correction. From the four reconstructions 128 x 128, 
nonoverlapping images are extracted and the complete re- 
constructed image is formed as a mosaic of them. An ex- 
ample of the result is shown in Fig. [31 By comparing with 



the reconstruction of the full image, it is clear that the 
mosaic of the four reconstructions does not exhibit visible 
boundary effects. 

The results of this experiment for the three objects 
are reported in Table [D The reconstruction error is the 
relative r.m.s. error between the mosaic and the original 
object. By comparing with the results of Table [U we find 
that the procedure does not significantly increase the re- 
construction error. We also point out that we choose the 
number of iterations corresponding to the global minimum, 
i.e. that providing the best performance on the mosaic of 
the four reconstructions obtained in the four subdomains. 
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Fig. 3. Upper-left panel: the original nebula; upper-right 
panel: its blurred and noisy image in the case m = 10; 
lower left panel: reconstruction of the global image; lower- 
left panel: reconstruction as a mosaic of four reconstruc- 
tions of partially overlapping subdomains, using the algo- 
rithms with boundary effect correction. 



The computational time is the total time of the four re- 
constructions. 

The third experiment intends to investigate the 
speedups achievable by SGP when varying the size of the 
images. We adopt the same procedure used in Ruggiero 
et al. (2010). The original 256 x 256 objects are convolved 
with an ideal PSF (described in the second paragraph of 
Sect. 14. 1[) and perturbed with background and Poisson 
noise. Next, images with a larger size are obtained by a 
Fourier-based re-binning, i.e. the FFT of the original im- 
age is expanded by zero padding to a double-sized array 
and the zero frequency component is multiplied by four. In 
this way, the background and the noise level are approx- 
imately unchanged and no new content is introduced at 
high frequencies. In particular, no out-of-band noise is in- 
troduced and therefore the number of iterations needed to 
converge to the best solution is probably underestimated, 
since we use, for any size, the number derived in the case 
256 X 256. In this experiment, we consider only the neb- 
ula and the galaxy with two magnitudes 10 and 15. The 
original images are expanded up to a size of 2048 x 2048. 
The results are reported in Tables |2] and HI where we high- 
lighted both the speedup observed between GPU and se- 
rial implementations (labeled "Par" ) and the one provided 
by the use of SGP instead of RL (labeled "Alg"). We note 
that the computational gain achieved by the parallel ar- 
chitecture increases in proportion to the size of the image. 
As far as the speedup of SGP with respect to RL is con- 
cerned, strong problem-dependent differences in the num- 
ber of iterations required to reach the minimum errors do 
not lead to a similarly regular behavior. 



Table 3. Reconstruction of the nebula NGC7027 with 
different image sizes. 







m = 10 








Algorithm 


Size 


Err 


Sec 


SpUp 
(Par) 


SpUp 
(Alg) 




256^ 


0.051 


783.9 


- 


- 


RL 


512^ 


0.051 


4527 


- 


- 


It = 10000* 


1024^ 


0.051 


17610 


- 


- 




2048^ 


0.051 


80026 


- 


- 




256^ 


0.051 


35.63 


22.0 


- 


RL.CUDA 


512^ 


0.051 


69.77 


64.9 




It = 10000* 


1024^ 


0.051 


149.5 


118 


- 




2048^ 


0.051 


469.1 


171 


- 




256^ 


0.052 


26.14 


- 


30.0 


SGP 


512^ 


0.051 


143.6 


- 


31.5 


It = 272 


1024^ 


0.051 


554.0 


- 


31.8 




2048^ 


0.051 


2493 


- 


32.1 




256^ 


0.052 


1.797 


14.5 


19.8 


SGP_CUDA 


512^ 


0.052 


3.469 


41.4 


20.1 


It = 272 


1024^ 


0.052 


8.016 


69.1 


18.7 




2048^ 


0.052 


25.66 


97.2 


18.3 






m = 15 








Algorithm 


Size 


Err 


Sec 


SpUp 
(Par) 


SpUp 
(Alg) 




256^ 


0.068 


48.27 


- 


- 


RL 


512^ 


0.064 


278.7 


- 


- 


It = 612 


1024^ 


0.062 


1068 


- 


- 




2048^ 


0.062 


4897 


- 


- 




256^ 


0.068 


2.219 


21.8 




RL.CUDA 


512^ 


0.064 


4.109 


67.8 




It = 612 


1024^ 


0.062 


9.250 


115 






2048^ 


0.062 


29.13 


168 






256^ 


0.068 


3.016 




16.0 


SGP 


512^ 


0.064 


16.95 




16.4 


It = 31 


1024^ 


0.062 


65.22 




16.4 




2048^ 


0.061 


290.8 




16.8 




256^ 


0.068 


0.218 


13.8 


10.2 


SGP_CUDA 


512^ 


0.064 


0.421 


40.3 


9.76 


It = 31 


1024^ 


0.062 


1.063 


61.4 


8.70 




2048^ 


0.061 


3.406 


85.4 


8.55 



4.2. Multiple images 

We test the efficiency of three algorithms for multiple im- 
age deconvolution, i.e. multiple RL, OSEM, and SGP (ap- 
plied to multiple RL), by means of simulated images of 
the Fizeau interferometer LING- NIRVANA (LN, for short; 
T. Herbst et al.. 12003]) of the Large Binocular Telescope 
(LBT). The LN is in an advanced realization phase by a 
consortium of German and Italian institutions, led by the 
Max Planck Institute for Astronomy in Heidelberg. It is 
a true imager with a maximum baseline of 22.8m, thus 
producing images with anisotropic resolution: that of a 
22.8m telescope in the direction of the baseline and that 
of a 8.4m (the diameter of LBT mirrors) in the orthogo- 
nal direction. By acquiring images with different orienta- 
tions of the baseline and applying suitable deconvolution 
methods, it is possible, in principle, to achieve the resolu- 
tion of a 22.8m telescope in all directions. The LN will be 
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Table 4. Reconstruction of the galaxy NGC6946 with dif- 
ferent image sizes. 







m = 10 








Algorithm 


Size 


Err 


Sec 


SpUp 
(Par) 


SpUp 
(Alg) 




256^ 


0.293 


786.0 


- 


- 


RL 


512^ 


0.293 


4545 


- 


- 


It = 10000* 


1024^ 


0.293 


17402 


- 


- 




2048^ 


0.293 


80022 


- 


- 




256^ 


0.293 


36.64 


21.5 


- 


RL.CUDA 


512^ 


0.293 


67.94 


66.9 




It = 10000* 


1024^ 


0.293 


146.7 


119 


- 




2048^ 


0.293 


463.9 


172 


- 




256^ 


0.292 


88.72 


- 


8.86 


SGP 


512^ 


0.291 


484.3 


- 


9.38 


It = 928 


1024^ 


0.291 


1854 


- 


9.19 




2048^ 


0.291 


8386 


- 


9.54 




256^ 


0.293 


7.219 


12.3 


5.08 


SGP_CUDA 


512^ 


0.293 


11.14 


43.5 


6.10 


It = 928 


1024^ 


0.293 


25.86 


71.7 


5.67 




2048^ 


0.293 


81.02 


104 


5.73 






m = 15 








Algorithm 


Size 


Err 


Sec 


SpUp 
(Par) 


SpUp 
(Alg) 




256^ 


0.311 


114.9 


- 


- 


RL 


512^ 


0.307 


644.3 


- 


- 


It = 1461 


1024^ 


0.306 


2574 


- 


- 




2048^ 


0.306 


11689 


- 


- 




256"^ 


0.311 


5.375 


21.4 




RL.CUDA 


512^ 


0.307 


9.656 


66.7 




It = 1461 


1024^ 


0.306 


22.41 


115 






2048^ 


0.306 


68.44 


171 






256^ 


0.311 


3.672 




31.3 


SGP 


512^ 


0.308 


20.36 




31.6 


It = 38 


1024^ 


0.307 


78.20 




32.9 




2048^ 


0.306 


354.0 




33.0 




256^ 


0.311 


0.266 


13.8 


20.2 


SGP_CUDA 


512^ 


0.307 


0.531 


38.3 


18.2 


It = 38 


1024^ 


0.307 


1.344 


58.2 


16.7 




2048^ 


0.306 


4.188 


84.5 


16.3 



equipped with a detector consisting of 2048 x 2048 pixels 
with a pixel size of about 5mas, corresponding to a FOV 
of 10" X 10" for each orientation of the baseline. Since in 
K-band the resolution of a 22.8ni mirror is about 20mas, 
the detector provides an oversampling of a factor four. 

In our simulations, we use PSFs generated with the 
code LOST (Arcidiacono et al. I2004p : one of them, with 
SR = 70% and horizontal baseline, is shown in Fig. 4 to- 
gether with the corresponding MTF. Moreover, we con- 
sider two test objects: one is again the nebula NGC7027, 
with two magnitudes, 10 and 15, and size 512 x 512 (there- 
fore, the images are noisier than those of the 256 x 256 
version with the same integrated magnitude) ; the other is 
a model of an open star cluster based on an image of the 
Pleiades (star cluster, for short), consisting of 9 stars with 
magnitudes ranging from 12.86 to 15.64. These objects are 
convolved with three PSFs corresponding to three equis- 
paccd orientations of the baseline, 0°, 60°, and 120°. In 




Fig. 4. Simulated PSF of LINC-NIRVANA with SR = 
70% (left panel) and the corresponding MTF (right panel). 
The PSF is monochromatic in K-band and is the PSF of 
a 8.4m mirror (the diameter of the two mirrors of LET) 
modulated by the interfcrometric fringes. Accordingly, in 
the MTF the central disk corresponds to the band of a 
8.4m mirror while the two side disks are replicas due to 
interfcrometry. 




Fig. 5. Interfcrometric images (horizontal baseline) of the 
512 x 512 Nebula with to = 15 (left panel) and of the star 
cluster (right panel). 



the u,v plane they provide a satisfactory coverage of the 
band of a 22.8m telescope (see, for instance, Bertero et al. 
120111 a review paper where the generation of the images 
used in this paper is described in greater detail). The re- 
sults are perturbed with a background of about 13.5 mag 
arcsec"^, corresponding to observations in K-band, and 
with both Poisson and Gaussian noises (cr = 10 e~/px). 
In Fig. 5, we show one interfcrometric image of the neb- 
ula, with magnitude 15, and one interfcrometric image of 
the star cluster, both with a horizontal baseline. 

4.2.1. Diffuse objects 

Wc now provide the results obtained in the case of the neb- 
ula with two magnitudes, 10 and 15. The stopping rule is 
given again by the minimum r.m.s. error. We first con- 
sider deconvolution without correction for edge artifacts 
because the object is within the image domain. The re- 
sults are reported in Table [5] If we compare the behaviors 
of single image and multiple image RL, we find that in the 
second larger number of iterations is required, owing 

to the difficulty in combining the resolutions of the differ- 
ent images to get a unique high-resolution reconstruction. 
Moreover, the greater cost per iteration has two causes: 
the first is that the size is 256 x 256 in the single case and 
512 X 512 in the multiple image case; the second is that one 
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Table 5. Reconstruction of the nebula using three equis- 
paced 512 x 512 images. 





m = 


10 






Algorithm 


It 


Err 


Sec 


SpUp 


RL 


3401 


0.032 


4364 


- 


RL.CUDA 


3401 


0.032 


48.00 


90.9 


OSEM 


1133 


0.032 


1602 




OSEM.CUDA 


1133 


0.032 


18.59 


86.2 


SGP 


144 


0.033 


220.7 




SGP_CUDA 


144 


0.033 


3.563 


61.9 




m = 


15 






Algorithm 


It 


Err 


Sec 


SpUp 


RL 


353 


0.091 


441.5 




RL.CUDA 


353 


0.091 


4.937 


89.4 


OSEM 


117 


0.091 


165.7 




OSEM.CUDA 


117 


0.091 


2.062 


80.4 


SGP 


16 


0.087 


26.14 




SGP_CUDA 


16 


0.087 


0.546 


47.9 



Table 6. Reconstruction of the nebula as a mosaic of four 
reconstructed subimages with boundary effect correction, 
also in the case of three equispaced images. 





m 


= 10 






Algorithm 


It 


Err 


Sec 


SpUp 


RL 


2899 


0.034 


13978 


- 


RL.CUDA 


2899 


0.034 


174.2 


80.2 


OSEM 


950 


0.034 


5447 




OSEM.CUDA 


950 


0.034 


64.03 


85.1 


SGP 


160 


0.034 


873.3 




SGP_CUDA 


160 


0.034 


15.45 


56.5 




m 


= 15 






Algorithm 


It 


Err 


Sec 


SpUp 


RL 


243 


0.094 


1174 




RL.CUDA 


243 


0.094 


15.28 


76.8 


OSEM 


81 


0.094 


479.1 




OSEM.CUDA 


81 


0.094 


5.939 


80.7 


SGP 


11 


0.087 


69.88 




SGP_CUDA 


11 


0.086 


1.532 


45.6 



single image iteration requires 4 FFTs, while one multiple 
image iteration, with three images, requires 10 FFTs. 

The results confirm that the speedup provided by 
OSEM with respect to multiple RL is about 2.5 with a 
reduction by a factor 3 in the number of iterations (see 
Sect. 2.3), although the speedup provided by SGP with 
respect to OSEM of a factor between 6 and 7 is interest- 
ing. This speedup presumably decreases as the number of 
images increases, but a speedup of about 20 is provided 
by OSEM in the case of 26 images, a number that presum- 
ably will never be achieved in the case of LN. Therefore, 
one can conclude that SGP can be recommended for the 
deconvoiution of LN images. Our CUDA implementations 
provide an additional speedup of about 80/90 for RL and 
OSEM, while smaller factors arc observed for SGP. 

When testing the accuracy of the deconvoiution meth- 
ods with boundary effect correction, we follow the same 
procedure used in the single image case, i.e. the images 
are partitioned into four partially overlapping subimages, 
the methods with boundary effect correction are applied 
and the final reconstruction is obtained as a mosaic of 
the four partial reconstructions. The results are reported 
in Table IH] and confirm the results obtained in the single 
image case. 

4.2.2. Point-wise objects 

In this case, iterations are pushed to convergence and 
therefore the stopping rule is given by the condition in Eq. 
([M)) : we use different values of tol, specifically 10~^, 10~^, 
and 10~^. In order to measure the quality of the recon- 
struction, we introduce an average relative error of the 
magnitudes defined by 



av_reLer = 




where q is the number of stars (in our case q = 9) and fhj 
and nij are respectively the true and the reconstructed 
magnitudes. The results are reported in Table [T] 

We first point out that, as in the previous cases, we 
constrain the parallel codes to perform the same number 
of iterations as the serial ones. This constraint is intro- 
duced because the EFT does not have the same precision 
in the two cases, as already discussed. As a result, the two 
implementations of the same algorithm do not provide the 
same error for the same number of iterations. This effect 
presumably will be removed when a double-precision EFT 
becomes available for GPU. 

Next, we find, as expected, that the number of iter- 
ations increases with decreasing values of tol. However, 
the increase in computation time is not compensated by 
a significant decrease in the accuracy of the reconstructed 
magnitudes. For tol = 10~^, the accuracy of the estimated 
magnitudes might already be satisfactory. We observe, 
however, that with this milder tolerance the accuracy pro- 
vided by the three algorithms is not the same. Multiple 
RL and OSEM seem to be slightly more accurate. The 
accuracy of all algorithms is essentially the same for the 
smaller tolerances. 

As a final experiment, we consider the reconstruction 
of a binary with high dynamic range (Bertero et al. l201ip . 
It consists of a primary with mi = 10 (denoted as ^i) and 
a secondary with m2 = 20 (denoted as 82)- The distance 
between the two stars is 45mas (i.e. 9 pixels for the LINC- 
NIRVANA detector) and the axis of the binary forms an 
angle of 23° with the direction of the baseline of the first 
image. Three equispaced images are generated as in the 
case of the star cluster, using the same PSFs and the same 
background. 

In this experiment, we need a very small tolerance, 
i.e. tol = 10~^, in order to allow SGP to detect the faint 
secondary. The reason is presumably that SGP requires 
a projection onto the non-negative orthant, and the cxis- 
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Table 7. Reconstruction of the star cluster with three 
512 X 512 equispaced images. The error is the average rel- 
ative error in the magnitudes defined in Eq. ([27)) . 





tol -- 


= le-3 






Algorithm 


It 


Err 


Sec 


SpUp 


RL 


319 


2.39e-4 


393.4 


- 


RL.CUDA 


319 


2.38e-4 


4.641 


84.8 


OSEM 


151 


1.63e-4 


220.8 


- 


OSEM.CUDA 


151 


1.62e-4 


2.421 


91.2 


SGP 


71 


1.35e-3 


9780 


- 


SGP_CUDA 


71 


1.29e-3 


1.641 


59.6 




tol -- 


= le-5 






Algorithm 


It 


Err 


Sec 


SpUp 


RL 


1385 


6.65e-5 


1703 


- 


RL.CUDA 


1385 


6.64e-5 


19.38 


87.9 


OSEM 


675 


5.64e-5 


980.6 




OSEM.CUDA 


675 


5.64e-5 


10.75 


91.2 


SGP 


337 


5.89e-4 


455.2 




SGP_CUDA 


337 


1.79e-4 


7187 


63.3 




tol -- 


= le-7 






Algorithm 


It 


Err 


Sec 


SpUp 


RL 


7472 


5.64e-5 


9180 




RL.CUDA 


7472 


5.98e-5 


104.8 


87.6 


OSEM 


3750 


6.13e-5 


5442 




OSEM.CUDA 


3750 


5.98e-5 


59.52 


91.4 


SGP 


572 


7.37e-5 


772.6 




SGP_CUDA 


572 


7.05e-5 


12.20 


63.3 



Table 8. Reconstruction of the binary with high dynamic 
range (image size: 256 x 256). 





tol -- 


= le-7 




Algorithm 


It 


Sec 


SpUp 


RL 


30765 


6108 




RL.CUDA 


30765 


292.9 


20.9 


OSEM 


14291 


3216 




OSEM.CUDA 


14291 


156.0 


20.6 


SGP 


2073 


482.8 




SGP_CUDA 


2073 


28.59 


16.9 




Mag 


nitude 




Algorithm 


Star 


Real 


Reconstructed 


RL 


SI 


10 


10.0001 


S2 


20 


20.1841 


OSEM 


SI 


10 


10.0001 


S2 


20 


20.0919 


SGP 


SI 


10 


10.0001 


S2 


20 


20.2683 



tence of this projection can make degrade the appearance 
of the secondary. In all cases, the results reported in Table 
[8] are interesting and demonstrate that the magnitude of 
the secondary can also be estimated with a sufficient ac- 
curacy in a reasonable computation time. 



5. Discussion 

The codes of the algorithms presented and discussed in 
this paper can be freely downloade^l. A MATLAB code 
of SGP for single image deconvolution is also available at 
the same URL, which enables one to compare the IDL and 
MATLAB implementations on one's own computer. 

The paper is based on the RL algorithm and its gener- 
alizations to boundary effect correction and multiple im- 
age deconvolution being scaled gradient methods, where 
the scaling is provided by the current iterate. Therefore, it 
is possible to attempt to improve the efficiency of these al- 
gorithms in the framework of the SGP approach proposed 
in Boncttini et al. (|2009|) . As already shown in that paper, 
the SGP version of RL provides a considerable increase in 
efficiency. 

The results given in the previous section demonstrate 
that SGP allows a significant speedup of all the RL-type 
algorithms considered in this paper, even if the speedup 
depends considerably on the specific object to be recon- 
structed and, for a given object, on the noise level; it 
ranges from about 4 in the case of multiple images of 
the star cluster (Table [71), to more than 30, in the case 
of a single image of the galaxy (Table [J]). A more accurate 
investigation of the speedup achievable would require ap- 
plication to a broader data set of astronomical objects as 
well as to images with different noise levels and noise real- 
izations. In all cases, we believe that the results presented 
in this paper are sufficient to demonstrate that SGP is 
a valuable acceleration of RL-like algorithms and that in 
several cases it allows a considerable reduction in compu- 
tational time. 

The speedup provided by GPU implementation is con- 
sistent with the results reported in Ruggiero ct al. (|2010p . 
The speedup of RL-algorithms is greater than that of 
SGP-algorithms because the main computational kernel 
of RL is FFT, while SGP is also based on the computation 
of steplengths, etc. Nevertheless, the gain with respect to 
RL is very significant, in some cases, allowing us to decon- 
volve a 2048 X 2048 image in a few seconds. We analyzed 
the use of a multi-thread FFTW in our C implementa- 
tion of the SGP algorithm, obtaining an improvement in 
the performance with respect to the corresponding serial 
code, which is however definitely lower than that achieved 
with the CUDA version. 

We conclude by emphasizing that in this paper we have 
considered a maximum likelihood approach for the image 
deblurring problems and used an early stopping of the iter- 
ative procedures to mimic a regularization effect. However, 
the SGP method can also be applied to regularized decon- 
volution in the framework of a Bayesian approach. The 
main problem would then be to decide on a rule to deter- 
mine a suitable scaling for a given regularization function, 
since the scaling should depend on this function. Such a 
rule can be provided by the split-gradient method (SGM) 
proposed by Lanteri et al. ([20011 [2002| . which can be con- 



at the URL htt p : / / www . unif e . it /prin /software] 
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sidered an improvement of the one-step late (OSL) method 
proposed by Green (|1990|) : the OSL scahng does not al- 
ways yield positive values, while the SGM scaling does. 

The scaling of SGM combined with SGP has been al- 
ready tested in the case of Poisson denoising (Zanella et 
al. and Poisson deblurring (Stagliano et al. [^UTT]) . 

which are both based on edge-preserving regularization. 
In both cases, this combination leads to very efhcient al- 
gorithms. The SGM scalings can also be designed for other 
kinds of regularization, and for a discussion of the case for 
Poisson data we refer to Bertero et al. (I^OIM [^gTT]) . 

Work is in progress to develop a library of SGP algo- 
rithms for Poisson data deconvoiution with a number of 
different kinds of regularization. 
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